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^ \ Abstract 
Ph. 

O ■ We present a novel framework for integrating prior knowledge into discriminative clas- 

sifiers. Our framework allows discriminative classifiers such as Support Vector Machines 
. (SVMs) to utilize prior knowledge specified in the generative setting. The dual objective of 

O^l \ fitting the data and respecting prior knowledge is formulated as a bilevel program, which 

is solved (approximately) via iterative application of second-order cone programming. To 
test our approach, we consider the problem of using WordNet (a semantic database of 
. English language) to improve low-sample classification accuracy of newsgroup categoriza- 

fyi2 ' tion. WordNet is viewed as an approximate, but readily available source of background 

O . knowledge, and our framework is capable of utilizing it in a flexible way. 

^ . 1. Introduction 

m , ^ ^ 

ff^ , While SVM (Vapnik, 1995) classification accuracy on many classification tasks is often 

\ competitive with that of human subjects, the number of training examples required to 

achieve this accuracy is prohibitively large for some domains. Intelligent user interfaces, 
for example, must adopt to the behavior of an individual user after a limited amount of 
interaction in order to be useful. Medical systems diagnosing rare diseases have to generalize 
well after seeing very few examples. Any natural language processing task that performs 
processing at the level of n-grams or phrases (which is frequent in translation systems) 
^ \ cannot expect to see the same sequence of words a sufficient number of times even in large 

\^ ' training corpora. Moreover, supervised classification methods rely on manually labeled 



O 



data, which can be expensive to obtain. Thus, it is important to improve classification 
performance on very small datasets. Most classifiers are not competitive with humans 
in their ability to generalize after seeing very few examples. Various techniques have been 
proposed to address this problem, such as active learning (Tong & Koller, 2000b; Campbell, 
Cristianini, & Smola, 2000), hybrid generative-discriminative classification (Raina, Shen, 
Ng, & McCallum, 2003), learning-to-learn by extracting common information from related 
learning tasks (Thrun, 1995; Baxter, 2000; Fink, 2004), and using prior knowledge. 

In this work, we concentrate on improving small-sample classification accuracy with 
prior knowledge. While prior knowledge has proven useful for classification (Scholkopf, 
Simard, Vapnik, & Smola, 2002; Wu & Srihari, 2004; Fung, Mangasarian, & Shavlik, 2002; 
Epshteyn &: DeJong, 2005; Sun & DeJong, 2005), it is notoriously hard to apply in practice 
because there is a mismatch between the form of prior knowledge that can be employed by 
classification algorithms (either prior probabilities or explicit constraints on the hypothesis 
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space of the classifier) and the domain theories articulated by human experts. This is 
unfortunate because various ontologies and domain theories are available in abundance, but 
considerable amount of manual effort is required to incorporate existing prior knowledge 
into the native learning bias of the chosen algorithm. What would it take to apply an 
existing domain theory automatically to a classification task for which it was not specifically 
designed? In this work, we take the first steps towards answering this question. 

In our experiments, such a domain theory is exemplified by WordNet, a linguistic 
database of semantic connections among English words (Miller, 1990). We apply Word- 
Net to a standard benchmark task of newsgroup categorization. Conceptually, a generative 
model describes how the world works, while a discriminative model is inextricably linked to 
a specific classification task. Thus, there is reason to believe that a generative interpretation 
of a domain theory would seem to be more natural and generalize better across different 
classification tasks. In Section 2 we present empirical evidence that this is, indeed, the 
case with WordNet in the context of newsgroup classification. For this reason, we interpret 
the domain theory in the generative setting. However, many successful learning algorithms 
(such as support vector machines) are discriminative. We present a framework which allows 
the use of generative prior in the discriminative classification setting. 

Our algorithm assumes that the generative distribution of the data is given in the 
Bayesian framework: Prob{data\model) and the prior Prob' (model) are known. However, 
instead of performing Bayesian model averaging, we assume that a single model M* has 
been selected a-priori, and the observed data is a manifestation of that model (i.e., it is 
drawn according to Prob{data\M*)). The goal of the learning algorithm is to estimate 
M*. This estimation is performed as a two-player sequential game of full information. 
The bottom (generative) player chooses the Bayes-optimal discriminator function /(M) for 
the probability distribution Prob{data\model = M) (without taking the training data into 
account) given the model M. The model M is chosen by the top (discriminative) player in 
such a way that its prior probability of occurring, given by Prob'(M), is high, and it forces 
the bottom player to minimize the training-set error of its Bayes-optimal discriminator 
f(M). This estimation procedure gives rise to a bilevel program. We show that, while the 
problem is known to be NP-hard, its approximation can be solved efficiently by iterative 
application of second-order cone programming. 

The only remaining issue is how to construct the generative prior Prob' (model) auto- 
matically from the domain theory. We describe how to solve this problem in Section 2, 
where we also argue that the generative setting is appropriate for capturing expert knowl- 
edge, employing WordNet as an illustrative example. In Section 3, we give the necessary 
preliminary information and important known facts and definitions. Our framework for in- 
corporating generative prior into discriminative classification is described in detail in Section 
4. We demonstrate the efficacy of our approach experimentally by presenting the results 
of using WordNet for newsgroup classification in Section 5. A theoretical explanation of 
the improved generalization ability of our discriminative classifier constrained by generative 
prior knowledge appears in Section 6. Section 7 describes related work. Section 8 concludes 
the paper and outlines directions for future research. 
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2. Generative vs. Discriminative Interpretation of Domain Knowledge 

WordNet can be viewed as a network, with nodes representing words and links representing 

relationships between two words (such as synonyms, hypernyms (is-a), meronyms (part- 
of), etc.). An important property of WordNct is that of semantic distance - the length 
(in links) of the shortest path between any two words. Semantic distance approximately 
captures the degree of semantic relatedness of two words. We set up an experiment to 
evaluate the usefulness of WordNet for the task of newsgroup categorization. Each posting 
was represented by a bag-of-words, with each binary feature representing the presence of 
the corresponding word. The evaluation was done on pairwise classification tasks in the 
following two settings: 

1. The generative framework assumes that each posting x = [x^, is generated by 
a distinct probability distribution for each newsgroup. The simplest version of a 
Linear Discriminan Analysis (LDA) classifier posits that x\(y = —1) ~ N{p,i,I) and 
x\{y = 1) ~ N{ii2,I) for posting x given label y G { — 1,1}, where / G is 
the identity matrix. Classification is done by assigning the most probable label to 
x: y{x) = 1 Prob{x\l) > Prob{x\ — 1). It is well-known (e.g. see Duda, Hart, Sz 
Stork, 2001) that this decision rule is equivalent to the one given by the hypcrplane 
(/i2 — ^J'i)'^x — ^{fi2 — /"i"/"i) > 0. The means /Tj are estimated via maximum 
likelihood from the training data .., [xm,ymY- 

2. The discriminative SVM classifier sets the separating hyperplane to directly minimize 
the number of errors on the training data: [w, b] = argmin^^^ ||u;|| s.t. yi{w'^Xi + b) > 
1,1 = 1, .., m. 

Our experiment was conducted in the learning-to-learn framework (Thrun, 1995; Baxter, 
2000; Fink, 2004). In the first stage, each classifier was trained using training data from the 
training task (e.g., for classifying postings into the newsgroups 'atheism' and 'guns'). In the 
second stage, the classifier was generalized using WordNet's semantic information. In the 
third stage, the generalized classifier was applied to a different, test task (e.g., for classifying 
postings for the newsgroups 'atheism' vs. 'mideast') without seeing any data from this new 
classification task. The only way for a classifier to generalize in this setting is to use the 
original sample to acquire information about WordNet, and then exploit this information 
to help it label examples from the test sample. In learning how to perform this task, the 
system also learns how to utilize the classification knowledge implicit in WordNet. 

We now describe the second and third stages for the two classifiers in more detail: 

1. It is intuitive to interpret information embedded in WordNet as follows: if the title 

of the newsgroup is 'guns', then all the words with the same semantic distance to 
'gun' (e.g., 'artillery', 'shooter', and 'ordnance' with the distance of two) provide a 
similar degree of classification information. To quantify this intuition, let li^train = 

[^ltTain^-^H,train^-^^i,train\ be the vector of Semantic distances in WordNet between 
each feature word j and the label of each training task newsgroup i G {1,2}. Define 

1. The standard LDA classifier assumes that x\{y = —1) ~ iV(//i,E) and x\{y = —1) ~ N{fj,2,^) and 
estimates the covariance matrix S as well as the means /ii , /Lt2 from the training data. In our experiments, 
we take E = 7. 
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l)Train: atheism vs. guns 2)Train: atheism vs. guns 3)Train: guns vs. mideast 
Test: atheism vs. mideast Test: guns vs. mideast Test: atheism vs. mideast 
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Figure 2.1: Test set accuracy as a percentage versus the number of training points for 3 
different classification experiments. For each classification task, a random test 
set is chosen from the full set of articles in 20 different ways. Error bars are 
based on 95% confidence intervals. 



Xi{v) = 



\r£, ■ =v\ 

I'' i,train I 



,i = 1,2, where | • | denotes cardinality of a set. Xi compresses 



information in /Tj based on the assumption that words equidistant from the newsgroup 
label are equally likely to appear in a posting from that newsgroup. To test the 
performance of this compressed classifier on a new task with semantic distances given 
by li,test, the generative distributions are reconstructed via /n] := XiiHtest)- Notice 
that if the classifier is trained and tested on the same task, applying the function Xi 
is equivalent to averaging the components of the means of the generative distribution 
corresponding to the equivalence classes of words equidistant from the label. If the 
classifier is tested on a different classification task, the reconstruction process reassigns 
the averages based on the semantic distances to the new labels. 

It is less intuitive to interpret WordNet in a discriminative setting. One possible 
interpretation is that coefficients of the separating hyperplane are governed by 
semantic distances to labels, as captured by the compression function u) = 



'^'^2, train" 



and reconstructed via :- 



Note that both the LDA generative classifier and the SVM discriminative classifier have 
the same hypothesis space of separating hyperplanes. The resulting test set classification 
accuracy for each classifier for a few classification tasks from the 20-newsgroup dataset 
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(Blake &; Merz, 1998) is presented in Figure 2.1. The x-axis of each graph represents the 
size of the training task sample, and the y-axis - the classifier's performance on the test 
classification task. The generative classifier consistently outperforms the discriminative 
classifier. It converges much faster, and on two out of three tasks the discriminative classifier 
is not able to use prior knowledge nearly as effectively as the generative classifier even after 
seeing 90% of all of the available training data. The generative classifier is also more 
consistent in its performance - note that its error bars are much smaller than those of 
the discriminative classifier. The results clearly show the potential of using background 
knowledge as a vehicle for sharing information between tasks. But the effective sharing is 
contingent on an appropriate task decomposition, here supplied by the tuned generative 
model. 

The evidence in Figure 2.1 seemingly contradicts the conventional wisdom that discrim- 
inative training outperforms generative for sufficiently large training samples. However, our 
experiment evaluates the two frameworks in the context of using an ontology to transfer 
information between learning tasks. This was never done before. The experiment demon- 
strates that the interpretation of semantic distance in WordNet is more intuitive in the 
generative classification setting, probably because it better reflects the human intuitions 
behind WordNet. 

However, our goal is not just to construct a classifier that performs well without seeing 
any examples of the test classification task. We also want a classifier that improves its 
behavior as it sees new labeled data from the test classification task. This presents us 
with a problem: one of the best-performing classifiers (and certainly the best on the text 
classification task according to the study by Joachims, 1998) is SVM, a discriminative 
classifier. Therefore, in the rest of this work, we focus on incorporating generative prior 
knowledge into the discriminative classification framework of support vector machines. 

3. Preliminciries 

It has been observed that constraints on the probability measure of a half-space can be 
captured by second-order cone constraints for Gaussian distributions (see, e.g., the tutorial 
by Lobo, Vandenberghe, Boyd, &: Lebret, 1998). This allows for efficient processing of such 
constraints within the framework of second-order cone programming (SOCP). We intend 
to model prior knowledge with elliptical distributions, a family of probability distributions 
which generalizes Gaussians. In what follows, we give a brief overview of second-order 
cone programming and its relationship to constraints imposed on the Gaussian probability 
distribution. We also note that it is possible to extend the argument presented by Lobo et 
al. (1998) to clhptical distributions. 

Second-order cone program is a mathematical program of the form: 



where a; G M" is the optimization variable and v eW^, Ai e M^'^*''"), 6, e M*^% Cj G , 
di E M. are problem parameters (||-|| represents the usual L2-norm in this paper). SOCPs 
can be solved efficiently with interior-point methods, as described by Lobo et al. (1998) in 
a tutorial which contains an excellent overview of the theory and applications of SOCP. 



min V X 



s.t. \\AiX + bi\\ < cjx + di, i = 1, N 



X 



(3.1) 
(3.2) 
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We use the elhptical distribution to model distribution of the data a-priori. Elliptical 
distributions are distributions with ellipsoidally-shapcd cquiprobablc contours. The density 
function of the n-variate elliptical distribution has the form /y^,s,g(aj) = c(det T,)~^g{{x — 
fi)'^T,^^{x — iJ,)), where a; G is the random variable, /x G is the location parameter, 
E G M("^") is a positive definite (n x ra)-matrix representing the scale parameter, function 
g{-) is the density generator, and c is the normalizing constant. We will use the nota- 
tion X ~ E(^,T,,g) to denote that the random variable X has an elliptical distribution 
with parameters iJ,,Ti,g. Choosing appropriate density generator functions g, the Gaussian 
distribution, the Student-t distribution, the Cauchy distribution, the Laplace distribution, 
and the logistic distribution can be seen as special cases of the elliptical distribution. Us- 
ing an elliptical distribution relaxes the restrictive assumptions the user has to make when 
imposing a Gaussian prior, while keeping many desirable properties of Gaussians, such as: 

1. If X - E{f^,T,,g), A € M^*^^"), and B € M*^, then AX + B r-. E{AiJ, + B,AT,A'^,g) 

2. If X ~ E{ii, E, g), then E{X) = fi. 

3. If X ~ E{iJ,,Y:,g), then Var{X) = a^E, where ag is a constant that depends on the 
density generator g. 

The following proposition shows that for elliptical distributions, the constraint P{w^x + b > 
0) < ?7 (i.e., the probability that X takes values in the half-space {w'^x + b > 0} is less than 
rj) is equivalent to a second-order cone constraint for r/ < ^: 

Proposition 3.1. If X ~ E{iJ., E,^), Prob{up- x + b > 0) < < ^ is equivalent to —{vFiJi + 
b)/Pg,r,> W^^^Ml' where Pg, ,j is a constant which only depends on g and rj. 

Proof. The proof is identical to the one given by Lobo (1998) and Lanckriet et al. (2001) 
for Gaussian distributions and is provided here for completeness: 

Assume Prob{uFx + 6 > 0) < ry. (3.3) 

Let u = w^x+b. Let u denote the mean of u, and a denote its variance. Then the constraint 
3.3 can be written as _ _ 

ProbC^ > < V- (3.4) 



By the properties of elliptical distributions, u = uP^ j^i + b, a = ||E^/^iu||, and ~ 
£^(0, 1, ^f). Thus, statement 3.4 above can be expressed as Probxr^E{o,i,g){^ ^ ~ .ya~ | |sV2'»^ | | ) — 
T], which is equivalent to — -^^^jj^jylj^ > $~^(r?), where $(z) = Probxr^E{o,i,g){X > The 
proposition follows with = ^^$"■'^(77). □ 

Proposition 3.2. For any monotonically decreasing g, Probx^E{^i,T,,g){x) > S is equivalent 
to ||E~-'^/^(x — m)|| ^ ^g,c,T., where <Pg,c,T,,s = 9~^{^^) is a constant which only depends on 

Proof. Follows directly from the definition of Prohxr^E{ii,T,,g){x)- □ 
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4. Generative Prior via Bilevel Programming 

We deal with the binary classification task: the classifier is a function /(x) which maps 
instances x G M"" to labels y € {— !> !}• In the generative setting, the probability densities 
Prob{x\y = — and Prob{x\y = 1; /X2) parameterized by /i = [fii,fi2] are provided (or 
estimated from the data), along with the prior probabilities on class labels n(y = —1) and 
n(y = 1), and the Bayes optimal decision rule is given by the classifier 

f{x\fi) = sign{Prob{x\y = -1; /Ui)n(y = -1) - Prob{x\y = 1; /U2)n(y = 1)), 

where sign{x) := 1 if x > and —1 otherwise. In LDA, for instance, the parameters fii and 
jjL2 are the means of the two Gaussian distributions generating the data given each label. 

Informally, our approach to incorporating prior knowledge is straightforward: we assume 
a two-level hierarchical generative probability distribution model. The low-level probability 
distribution of the data given the label Prob{x\y; fi) is parameterized by which, in turn, 
has a known probability distribution Prob'{n). The goal of the classifier is to estimate the 
values of the parameter vector fi from the training set of labeled points [xi,yi]...[xm,ym]- 
This estimation is performed as a two-player sequential game of full information. The 
bottom (generative) player, given /x, selects the Bayes optimal decision rule /(x|/Li). The 
top (discriminative) player selects the value of /j, which has a high probability of occurring 
(according to Prob'{fi)) and which will force the bottom player to select the decision rule 
which minimizes the discriminative error on the training set. We now give a more formal 
specification of this training problem and formulate it as a bilevel program. Some of the 
assumptions are subsequently relaxed to enforce both tractability and flexibility. 

We use an elliptical distribution E(^i,T,i, g) to model X\y = —1, and another elliptical 
distribution E(fi2,^2, g) to model X\y = 1. If the parameters fii,Yii,i = 1,2 are known, 
the Bayes optimal decision rule restricted to the class of linear classifiers^ of the form 
fw,b{x) = sign{vF x + b) is given by fix) which minimizes the probability of error among all 
linear discriminants: Prob{error) = Prob{w'^x + b > 0\y = l)Tl{y = 1) -|- Prob{w'^x + b < 
0\y = -l)U{y = -1) = ^{Probx^E(p„Eu9)i^^x + b > 0) + Probx^Ei^^,,E,,g){w^x + b < 0)), 
assuming equal prior probabilities for both classes. We now model the uncertainty in the 
means of the elliptical distributions /Xj, z = 1, 2 by imposing elliptical prior distributions on 
the locations of the means: Hi ~ E{ti, Qi,g),i = 1,2. In addition, to ensure the optimization 
problem is well-defined, we maximize the margin of the hyperplane subject to the imposed 
generative probability constraints: 

min (4-1) 

s.t.yi{w'^Xi + b)>l,i = l,..,m (4.2) 
Prob^.^E{u,cii,g)ifJ'i) >S,i = 1,2 (4.3) 
[w, b] solves rmn[Probxr^Eiij.u^i,g){w'^x -|- 6 > 0) -|- Probx^E{ti2,T,2,g){'^'^ ^ + 6 < 0)] 

(4.4) 

This is a bilevel mathematical program (i.e., an optimization problem in which the 
constraint region is implicitly defined by another optimization problem), which is strongly 

2. A decision rule restricted to some class of classifiers H is optimal if its probability of error is no larger 
than that of any other classifier in H (Tong & KoUer, 2000a). 
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NP-hard even when all the constraints and both objectives are linear (Hansen, Jaumard, 
& Savard, 1992). However, wc show that it is possible to solve a reasonable approxima- 
tion of this problem efficiently with several iterations of second-order cone programming. 
First, we relax the second-level minimization (4.4) by breaking it up into two constraints: 
Probxr^E{iJ.uJ:ug){w'^x + b > 0) < r] and Pro6x~E(/.2,S2,g)(^^a; + 6 < 0) < 77. Thus, in- 
stead of looking for the Bayes optimal decision boundary, the algorithm looks for a decision 
boundary with low probability of error, where low error is quantified by the choice of ij. 

Propositions 3.1 and 3.2 enable us to rewrite the optimization problem resulting from 
this relaxation as follows : 



mm \\w\ 



s.i.yiivF Xi + h) >\, i = 1, .., m 

Probu,.^E{ti,ni,g){l^i) >S,i = 1,2 <^ ^i^^'^ifJ-i - U) 

nF + b 



Probxr~.E{ni,T,i,g){w^x + 6 > 0) < ?7 <^ 



Probx^E{n2,^2,9)i'^ x + b<0)<r] ^ 



<ip,i = 1,2 

>/3 



S2' w 



>/3 



(4.5) 

(4.6) 
(4.7) 

(4.8) 
(4.9) 



Notice that the form of this program does not depend on the generator function g of the 
elliptical distribution - only constants /3 and tp depend on it. tp defines how far the system 
is willing to deviate from the prior in its choice of a generative model, and (5 bounds the 
tail probabilities of error (Type I and Type H) which the system will tolerate assuming its 
chosen generative model is correct. These constants depend both on the specific generator 
g and the amount of error the user is willing to tolerate. In our experiments, we select 
the values of these constants to optimize performance. Unless the user wants to control 
the probability bounds through these constants, it is sufficient to assume a-priori only that 
probability distributions (both prior and hyper-prior) are elliptical, without making any 
further commitments. 

Our algorithm solves the above problem by repeating the following two steps: 

1. Fix the top-level optimization parameters and H2- This step combines the objec- 
tives of maximizing the margin of the classifier on the training data and ensuring that 
the decision boundary is (approximately) Bayes optimal with respect to the given 
generative probability densities specified by the fii,fi2- 

2. Fix the bottom-level optimization parameters w, b. Expand the feasible region of the 
program in step 1 as a function of Hi,fJ,2- This step fixes the decision boundary and 
pushes the means of the generative distribution as far away from the boundary as the 
constraint (4.7) will allow. 

The steps are repeated until convergence (in practice, convergence is detected when the 
optimization parameters do not change appreciably from one iteration to the next). Each 
step of the algorithm can be formulated as a second-order cone program: 
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Step 1. Fix Hi and 112- Removing unnecessary constraints from the mathematical 
program above and pushing the objective into constraints, we get the following SOCP: 



mm/9 
w.b 



s.t.p > 

yi{'uFxi + 6) > 1, z = 1, .., m 
vp- Hi + b 



\l W 



w'^H2 + b 



^2 W 



>/3 



(4.10) 

(4.11) 
(4.12) 

(4.13) 
(4.14) 



Step 2. Fix w, b and expand the span of the feasible region, as measured by ii2^\ ~ 



^1+^ . Removing unnecessary constraints, we get: 

uF 112 + b vo^ Hi + b 



I'f-l/S II 

1 



max- 



S.t. 



^1/2 

QJ^^'^ifii - ti) 



S^' w 



<(f,i = l,2 



(4.15) 
(4.16) 



w 



The behavior of the algorithm is illustrated in Figure 4.1. 
The following theorems state that the algorithm converges. 

Theorem 4.1. Suppose that the algorithm produces a sequence of iterates 

//2*\ ^^*^| ' '^'^d the quality of each iterate is evaluated by its margin 
This evaluation function converges. 

Proof. Let jif' , /i2*^ be the values of the prior location parameters, and wf^ , 6^ ^ be the 
minimum error hyperplane the algorithm finds at the end of the t-th step. At the end of 
the {t + l)-st step, wf^'\bf^'^ is still in the feasible region of the t-th step SOCP. This 
is true because the function /(M!!)W!l _ ij^^HTl^ _ M!!)Im±^ 



1/2,, 



,(t) 



is monotonically increasing in each one of its arguments when the other argument is fixed, 

e argument. If 1 



and fixing hi (or H2) fixes exactly one argument. If the solution Hi^^\ l^2~^^^ the end 



of the {t + l)-st step were such that 
fixing /uf^^^ i 



< (3, then / could be increased by 



fixing Hi'^^^ and using the value of H2^ from the beginning of the step which ensures that 



> P, which contradicts the observation that / is maximized at the end of 



the second step. The same contradiction is reached if — - 



< p. Since the 



minimum error hyperplane from the previous iteration is in the feasible region at the start 
of the next iteration, the objective Hi-f^*^!! must decrease monotonically from one iteration 
to the next. Since it is bounded below by zero, the algorithm converges. □ 
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Figure 4.1: Steps of the iterative (hard-margin) SOCP procedure: 
(The region where the hyperprior probabiUty is larger than S is shaded for each prior 
distribution. The covariance matrices are represented by equiprobable ehiptical contours. 
In this example, the covariance matrices of the hyperprior and the prior distributions are 
multiples of each other. Data points from two different classes are represented by diamonds 
and squares.) 

1. Data, prior, and hyperprior before the algorithm is executed. 

2. Hyperplane discriminator at the end of step 1, iteration 1 

3. Priors at the end of step 2, iteration 1 

4. Hyperplane discriminator at the end of step 2, iteration 2 

The algorithm converges at the end of step 2 for this problem (step 3 does not move the 
hyperplane) . 



In addition to the convergence of the objective function, the accumulation points of the 
sequence of iterates can be characterized by the following theorem: 

Theorem 4.2. The accumulation points of the sequence ij.2\ w (i.e., limiting 

points of its convergent subsequences) have no feasible descent directions for the original 
optimization problem given by (4-5)-(4-9). 

Proof. See Appendix A. □ 
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If a point has no feasible descent directions, then any sufficiently small step along any 
directional vector will cither increase the objective function, leave it unchanged, or take the 
algorithm outside of the feasible region. The set of points with no feasible descent directions 
is a subset of the set of local minima. Hence, convergence to such a point is a somewhat 
weaker result than convergence to a local minimum. 

In practice, we observed rapid convergence usually within 2-4 iterations. 

Finally, we may want to relax the strict assumptions of the correctness of the prior/linear 
separability of the data by introducing slack variables into the optimization problem above. 
This results in the following program: 



rnin + Ci V + C2{Ci + C2) + ^3(1^1 + V2) 

/il,;i2,1f,6,^i,Cl,C2,!^l,1^2 



1=1 



sX.yi{w^Xi + 6) > 1 - ^j, z = 1, ..,m 



^,1 + b 



> 



> 



^1/2 



-Ci 

C2 



vV2 

W 



> 0, i = 1, ..,m 
z/j > 0, i = l,2 
0>0, i = l,2 



(4.17) 

(4.18) 
(4.19) 

(4.20) 

(4.21) 

(4.22) 
(4.23) 
(4.24) 



As before, this problem can be solved with the two-step iterative SOCP procedure. 

Imposing the generative prior with soft constraints ensures that, as the amount of training 
data increases, the data overwhelms the prior and the algorithm converges to the maximum- 
margin separating hyperplane. 



5. Experiments 

The experiments were designed both to demonstrate the usefulness of the proposed approach 
for incorporation of generative prior into discriminative classification, and to address a 
broader question by showing that it is possible to use an existing domain theory to aid in 
a classification task for which it was not specifically designed. In order to construct the 
generative prior, the generative LDA classifier was trained on the data from the training 
classification task to estimate the Gaussian location parameters /xj, z = 1, 2, as described 
in Section 2. The compression function Xi{v) is subsequently computed (also as described 
in Section 2), and is used to set the hyperprior parameters via := Xii^j test) ^ ~ 
In order to apply a domain theory effectively to the task for which it was not specifically 
designed, the algorithm must be able to estimate its confidence in the decomposition of the 
domain theory with respect to this new learning task. In order to model the uncertainty in 
applicability of WordNet to newsgroup categorization, our system estimated its confidence in 
homogeneity of equivalence classes of semantic distances by computing the variance of each 
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Figure 5.1: Performance of the bilevel discriminative classifier constrained by generative 
prior knowledge versus performance of SVM. Each point represents a unique 
pair of training/test tasks, with 0.5% of the test task data used for training. 
The results are averaged over 100 experiments. 



random variable Xi(^) follows: cri{v) = ^,tra^n-v _ The hyperprior confidence 

matrices = 1,2 were then reconstructed with respect to the test task semantic distances 

k,test,i = 1,2 as follows: [rijjj^fc := <. q**^^*^ . • Identity matrices were used as 

covariance matrices of the lower-level prior: Si = T,2 := /. The rest of the parameters 
were set as follows: (3 := 0.2, (p := 0.01, Ci = C2 := 1, C3 := 00. These constants were 
chosen manually to optimize performance on Experiment 1 (for the training task: atheism 
vs. guns, test task: guns vs. mideast, see Figure 5.2) without observing any data from any 
other classification tasks. 

The resulting classifier was evaluated in different experimental setups (with different 
pairs of newsgroups chosen for the training and the test tasks) to justify the following 
claims: 

1. The bilevel generative/discriminative classifier with WordNet-derived prior knowl- 
edge has good low-sample performance, showing both the feasibility of automatically 
interpreting the knowledge embedded in WordNet and the efl&cacy of the proposed 
algorithm. 

2. The bilevel classifier's performance improves with increasing training sample size. 

3. Integrating generative prior into the discriminative classification framework results 
in better performance than integrating the same prior directly into the generative 
framework via Bayes' rule. 
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4. The bilevel classifier outperforms a statc-of-thc-art discriminative multitask classifier 
proposed by Evgeniou and Pontil (2004) by taking advantage of the WordNet domain 
theory. 

In order to evaluate the low-sample performance of the proposed classifier, four newsgroups 
from the 20-newsgroup dataset were selected for experiments: atheism, guns, middle east, 
and auto. Using these categories, thirty experimental setups were created for all the possible 
ways of assigning newsgroups to training and test tasks (with a pair of newsgroups assigned 
to each task, under the constraint that the training and test pairs cannot be identical)^. In 
each experiment, we compared the following two classifiers: 

1. Our bilevel generative-discriminative classifier with the knowledge transfer functions 
Xi{v),ai{v),i = 1,2 learned from the labeled training data provided for the train- 
ing task (using 90% of all the available data for that task). The resulting prior was 
subsequently introduced into the discriminative classification framework via our ap- 
proximate bilevel programming approach 

2. A vanilla SVM classifier which minimizes the regularized empirical risk: 

m 

mmY^i + CiWwf (5.1) 

1=1 

^X.yi(uFxi -\-h)>\ - ii,i = \, .., m (5.2) 



Both classifiers were trained on 0.5% of all the available data from the test classification 
task^, and evaluated on the remaining 99.5% of the test task data. The results, averaged 
over one hundred randomly selected datasets, are presented in Figure 5.1, which shows the 
plot of the accuracy of the bilevel generative/discriminative classifier versus the accuracy 
of the SVM classifier, evaluated in each of the thirty experimental setups. All the points 
lie above the 45" line, indicating improvement in performance due to incorporation of prior 
knowledge via the bilevel programming framework. The amount of improvement ranges 
from 10% to 30%, with all of the improvements being statistically significant at the 5% 
level. 

The next experiment was conducted to evaluate the effect of increasing training data 
(from the test task) on the performance of the system. For this experiment, we selected 
three newsgroups (atheism, guns, and middle east) and generated six experimental setups 
based on all the possible ways of splitting these newsgroups into unique training/test pairs. 
In addition to the classifiers 1 and 2 above, the following classifiers were evaluated: 

3. A state-of-the art multi-task classifier designed by Evgeniou and Pontil (2004). The 
classifier learns a set of related classification functions ft{x) = wjx + bt for classifica- 
tion tasks t G {training task, test task} given m{t) data points [xu, yu], ■■, [xm{t)tiym(t)t] 

3. Newsgroup articles were preprocessed by removing words which could not be interpreted as nouns by 
WordNet. This preprocessing ensured that only one part of WordNet domain theory was exercised and 

resulted in virtually no reduction in classification accuracy. 

4. SeDuMi software (Sturm, 1999) was used to solve the iterative SOCP programs. 
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for each task t by minimizing the regularized empirical risk: 

™™ + TT XI 11^* ~ ■^oll^ + C*! ||u;of (5.3) 

t 1=1 t 

s.t. yit{wfxit + bt) > 1 -^,it,i = l,..,m(t),Vt (5.4) 
^u>0,i = l,..,m{t),yt (5.5) 

The regularization constraint captures a tradeoff between final models wt being close 
to the average model wq and having a large margin on the training data. 90% of the 
training task data was made available to the classifier. Constant Ci := 1 was chosen, 
and C2 := 1000 was selected from the set {.1, .5, 1,2, 10, 1000, 10^, 10^°} to optimize 
the classifier's performance on Experiment 1 (for the training task: atheism vs. guns, 
test task: gims vs. mideast, see Figure 5.2) after observing .05% of the test task data 
(in addition to the training task data). 

4. The LDA classifier described in Section 2 trained on 90% of the test task data. Since 
this classifier is the same as the bottom-level generative classifier used in the bilevel 
algorithm, its performance gives an upper bound on the performance of the bottom- 
level classifier trained in a generative fashion. 

Figure 5.2 shows performance of classifiers 1-3 as a function of the size of the training 
data from the test task (evaluation was done on the remaining test-task data). The results 

are averaged over one hundred randomly selected datasets. The performance of the bilevel 
classifier improves with increasing training data both because the discriminative portion of 
the classifier aims to minimize the training error and because the generative prior is imposed 
with soft constraints. As expected, the performance curves of the classifiers converge as the 
amount of available training data increases. Even though the constants used in the math- 
ematical program were selected in a single experimental setup, the classifier's performance 
is reasonable for a wide range of data sets across different experimental setups, with the 
possible exception of Experiment 4 (training task: guns vs. mideast, testing task: atheism 
vs. mideast), where the means of the constructed elliptical priors are much closer to each 
other than in the other experiments. Thus, the prior is imposed with greater confidence 
than is warranted, adversely affecting the classifier's performance. 

The multi-task classifier 3 outperforms the vanilla SVM by generalizing from data points 
across classification tasks. However, it does not take advantage of prior knowledge, while our 
classifier does. The gain in performance of the bilevel generative/discriminative classifier 
is due to the fact that the relationship between the classification tasks is captured much 
better by WordNet than by simple linear averaging of weight vectors. 

Because of the constants involved in both the bilevel classifier and the generative classi- 
fiers with Bayesian priors, it is hard to do a fair comparison between classifiers constrained 
by generative priors in these two frameworks. Instead, the generatively trained classifier 4 
gives an empirical upper bound on the performance achievable by the bottom-level classifier 
trained generatively on the test task data. The accuracy of this classifier is shown as as 
a horizontal in the plots in Figure 5.2. Since discriminative classification is known to be 
superior to generative classification for this problem, the SVM classifier outperforms the 
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Figure 5.2: Test set accuracy as a percentage versus number of test task training points for 
two classifiers (SVM and Bilevel Gen/Discr) tested on six different classification 
tasks. For each classification experiment, the data set was split randomly into 
training and test sets in 100 different ways. The error bars based on 95% 
confidence intervals. 



generative classifier given enough data in four out of six experimental setups. What is more 
interesting, is that, for a range of training sample sizes, the bilevel classifier constrained 
by the generative prior outperforms both the SVM trained on the same sample and the 
generative classifier trained on a much larger sample in these four setups. This means that, 
unless prior knowledge outweighs the effect of learning, it cannot enable the LDA classifier 
to compete with our bilevel classifier on those problems. 

Finally, a set of experiments was performed to determine the effect of varying math- 
ematical program parameters j5 and ip on the generalization error. Each parameter was 
varied over a set of values, with the rest of the parameters held fixed (/? was increased up 
to its maximum feasible value) . The evaluation was done in the setup of Experiment 1 (for 
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Figure 5.3: Plots of test set accuracy as percentage versus mathematical program parameter 
values. For each classification task, a random training set of size 9 was chosen 
from the full set of test task articles in 100 different ways. Error bars are based 
on 95% confidence intervals. All the experiments were performed on the training 
task: atheism vs. guns, test task: guns vs. mideast. 



the training task:atheism vs. guns, test task: guns vs. mideast), with the training set size 
of 9 points. The results are presented in Figure 5.3. Increasing the value of P is equivalent 
to requiring a hyperplane separator to have smaller error given the prior. Decreasing the 
value of ip is equivalent to increasing the confidence in the hyperprior. Both of these actions 
tighten the constraints (i.e., decrease the feasible region). With good prior knowledge, this 
should have the effect of improving generalization performance for small training samples 
since the prior is imposed with higher confidence. This is precisely what we observe in the 
plots of Figure 5.3. 

6. Generalization Performance 

Why docs the algorithm generalize well for low sample sizes? In this section, we derive a 
theorem which demonstrates that the convergence rate of the generalization error of the 
constrained generative-discriminative classifier depends on the parameters of the mathe- 
matical program and not just the margin, as would be expected in the case of large-margin 
classification without the prior. In particular, we show that as the certainty of the genera- 
tive prior knowledge increases, the upper bound on the generalization error of the classifier 
constrained by the prior decreases. By increasing certainty of the prior, we mean that 
either the hyper-prior becomes more peaked (i.e., the confidence in the locations of the 
prior means increases) or the desired upper bounds on the Type I and Type II probabilities 
of error of the classifier decrease (i.e., the requirement that the lower-level discriminative 
player choose the restricted Bayes-optimal hyperplane is more strictly enforced). 

The argument proceeds by bounding the fat-shattering dimension of the classifier con- 
strained by prior knowledge. The fat-shattering dimension of a large margin classifier is 
given by the following definition (Taylor &: Bartlett, 1998): 

Definition 6.1. A set of points S = {x^...x'^} is ^-shattered by a set of functions F 
mapping from a domain X to if there are real numbers r^,...,r"^ such that, for each 
b G {—1,1}™, there is a function fh in F with b{fi,{x^) — r*) > 7, i = l..m. We say 
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that r^, ...,r'" witness the shattering. Then the fat-shattering dimension of F is a function 
fatpi'y) that maps 7 to the cardinality of the largest ^-shattered set S. 



Specifically, we consider the class of functions 

F = {x ^ w X : \\x\\ < R, \\w\ 



(6.1) 



w 



Si w 



> A 



-1/2 



So W 



>/3, 



-1/2 



(M2 - t2) < if}. 



The following theorem bounds the fat-shattering dimension of our classifier: 

Theorem 6.2. Let F be the class of a-priori constrained functions defined by (6.1), and 

let Xmin{P) CLiT-d ^maxiP) denote the minimum and maximum eigenvalues of matrix P, 
respectively. If a set of points S is ^(-shattered by F, then \S\ < ^" , where 



a = max(Q!i,a2) with ai = min( 

||tl||^-(A,nax(ni)y) 



■^min(Sl)/3 



and «2 = min( 



11*2 II (A max (02)^)^ + 11*2 



•^min(^l)/3 

IImiII ' 



fj^^^rp^^^), assuming that ^ > 0, \\ti\\ > \\ti - mW, and ai> ^,i = 1,2. 



Proof. See Appendix B. 



□ 



We have the following corollary which follows directly from Taylor and Bartlett's (1998) 
Theorem 1.5 and bounds the classifier's generalization error based on its fat-shattering 
dimension: 



Corollary 6.3. Let G be a class of real-valued functions. Then, with probability at least 

1 — S over m independently generated examples z, if a classifier h = sgn{g) G sgn{G) has 
margin at least 7 on all the examples in z, then the error of h is no more than ^{d * 

fatcijo). If G = F is the class of functions 

If G = F' is the usual class of large margin 
classifiers (without the prior), then the result in (Taylor & Bartlett, 1998) shows that dpi < 

265il2 



log{^)log{'i2m) 



log{^)) where da 
defined by (6.1), then dp < ^65i?^(4(a^^(i-a^))) 



Notice that both bounds depend on \. However, the bound of the classifier constrained 
by the generative prior also depends on ^ and if through the term 4(q!^(1 — a^)). In partic- 
ular, as (3 increases, tightening the constraints, the bound decreases, ensuring, as expected, 
quicker convergence of the generalization error. Similarly, decreasing 99 also tightens the 
constraints and decreases the upper bound on the generalization error. For > the 
factor 4(q!^(1 — a^)) is less than 1 and the upper bound on the fat-shattering dimension dp 
is tighter than the usual bound in the no-prior case on dp'- 

Since /3 controls the amount of deviation of the decision boundary from the Bayes- 
optimal hyperplane and (p depends on the variance of the hyper-prior distribution, tightening 
of these constraints corresponds to increasing our confidence in the prior. Note that a high 
value P represents high level of user confidence in the generative elliptical model. Also 
note that there are two ways of increasing the tightness of the hyperprior constraint (4.7) 
- one is through the user-defined parameter (p, the other is through the automatically 
estimated covariance matrices ilj, i = 1,2. These matrices estimate the extent to which the 
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equivalence classes defined by WordNet create an appropriate decomposition of the domain 
theory for the newsgroup categorization task. Thus, tight constraint (4.7) represents both 
high level of user confidence in the means of the generative classification model (estimated 
from WordNet) and a good correspondence between the partition of the words imposed 
by the semantic distance of WordNet and the elliptical generative model of the data. As 
ip approaches zero and /3 approaches its highest feasible value, the solution of the bilcvcl 
mathematical program reduces to the restricted Bayes optimal decision boundary computed 
solely from the generative prior distributions, without using the data. 

Hence, we have shown that, as the prior is imposed with increasing level of confidence 
(which means that the elliptical generative model is deemed good, or the estimates of 
its means are good, which in turn implies that the domain theory is well-suited for the 
classification task at hand) , the convergence rate of the generalization error of the classifier 
increases. Intuitively, this is precisely the desired effect of increased confidence in the prior 
since the benefit derived from the training data is outweighed by the benefit derived from 
prior knowledge. For low data samples, this should result in improved accuracy assuming 
the domain theory is good, which is what the plots in Figure 5.3 show. 

7. Related Work 

There are a number of approaches to combining generative and discriminative models. Sev- 
eral of these focus on deriving discriminative classifiers from generative distributions (Tong 
& Koller, 2000a; Tipping, 2001) or on learning the parameters of generative classifiers via 
discriminative training methods (Greiner &: Zhou, 2002: Roos, Wettig, Grunwald, Myl- 
lymaki, &; Tirri, 2005). The closest in spirit to our approach is the Maximum Entropy 
Discrimination framework (Jebara, 2004; Jaakkola, Meila, & Jebara, 1999), which performs 
discriminative estimation of parameters of a generative model, taking into account the con- 
straints of fitting the data and respecting the prior. One important difference with our 
framework is that, in estimating these parameters, maximiim entropy discrimination min- 
imizes the distance between the generative model and the prior, subject to satisfying the 
discriminative constraint that the training data be classified correctly with a given margin. 
Our framework, on the other hand, maximizes the margin on the training data subject to 
the constraint that the generative model is not too far from the prior. This emphasis on 
maximizing the margin allows us to derive a-priori bounds on the generalization error of 
our classifier based on the confidence in the prior which are not (yet) available for the max- 
imum entropy framework. Another difference is that our approach performs classification 
via a single generative model, while maximum entropy discrimination averages over a set of 
generative models weighted by their probabilities. This is similar to the distinction between 
maximum-a-posteriori and Bayesian estimation and has repercussions for tractability. Max- 
imum entropy discrimination, however, is more general than our framework in a sense of 
allowing a richer set of behaviors based on different priors. 

Ng et al. (2003, 2001) explore the relative advantages of discriminative and generative 
classification and propose a hybrid approach which improves classification accuracy for 
both low-sample and high-sample scenarios. Collins (2002) proposes to use the Viterbi 
algorithm for HMMs for inferencing (which is based on generative assumptions), combined 
with a discriminative learning algorithm for HMM parameter estimation. These research 
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directions arc orthogonal to our work since they do not exphcitly consider the question of 
integration of prior knowledge into the learning problem. 

In the context of support vector classification, various forms of prior knowledge have 
been explored. Scholkopf et al. (2002) demonstrate how to integrate prior knowledge about 
invariance under transformations and importance of local structure into the kernel function. 
Fung ct al. (2002) use domain knowledge in form of labeled polyhedral sets to augment 
the training data. Wu and Srihari (2004) allow domain experts to specify their confidence 
in the example's label, varying the effect of each example on the separating hyperplane 
proportionately to its confidence. Epshteyn and DeJong (2005) explore the effects of ro- 
tational constraints on the normal of the separating hyperplane. Sun and DeJong (2005) 
propose an algorithm which uses domain knowledge (such as WordNct) to identify relevant 
features of examples and incorporate resulting information in form of soft constraints on 
the hypothesis space of SVM classifier. Mangasarian et al. (2004) suggest the use of prior 
knowledge for support vector regression. In all of these approaches, prior knowledge takes 
the form of explicit constraints on the hypothesis space of the large-margin classifier. In this 
work, the emphasis is on generating such constraints automatically from domain knowledge 
interpreted in the generative setting. As we demonstrate with our WordNet application, 
generative interpretation of background knowledge is very intuitive for natural language 
processing problems. 

Second-order cone constraints have been applied extensively to model probability con- 
straints in robust convex optimization (Lobo et al., 1998; Bhattacharyya, Pannagadatta, & 
Smola, 2004) and constraints on the distribution of the data in minimax machines (Lanckriet 
et al., 2001; Huang, King, Lyu, &: Chan, 2004). Our work, as far as we know, is the first one 
which models prior knowledge with such constraints. The resulting optimization problem 
and its connection with Bayes optimal classification is very different from the approaches 
mentioned above. 

Our work is also related to empirical Bayes estimation (Carlin & Louis, 2000). In em- 
pirical Bayes estimation, the hyper-prior parameters of the generative model are estimated 
using statistical estimation methods (usually maximum likelihood or method of moments) 
through the marginal distribution of the data, while our approach learns those parameters 
discriminatively using the training data. 

8. Conclusions and Future Work. 

Since many sources of domain knowledge (such as WordNet) are readily available, we believe 
that significant benefit can be achieved by developing algorithms for automatically applying 
their information to new classification problems. In this paper, we argued that the gener- 
ative paradigm for interpreting background knowledge is preferable to the discriminative 
interpretation, and presented a novel algorithm which enables discriminative classifiers to 
utilize generative prior knowledge. Our algorithm was evaluated in the context of a com- 
plete system which, faced with the newsgroup classification task, was able to estimate the 
parameters needed to construct the generative prior from the domain theory, and use this 
construction to achieve improved performance on new newsgroup classification tasks. 

In this work, we restricted our hypothesis class to that of linear classifiers. Extending 
the form of the prior distribution to distributions other than elliptical and/or looking for 
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Bayes-optimal classifiers restricted to a more expressive class than that of linear separators 
may result in improvement in classification accuracy for non linearly-separable domains. 
However, it is not obvious how to approximate this more expressive form of prior knowledge 
with convex constraints. The kernel trick may be helpful in handling nonlinear problems, 
assuming that it is possible to represent the optimization problem exclusively in terms of 
dot products of the data points and constraints. This is an important issue which requires 
further study. 

We have demonstrated that interpreting domain theory in the generative setting is 
intuitive and produces good empirical results. However, there are usually multiple ways 
of interpreting a domain theory. In WordNet, for instance, semantic distance between 
words is only one measure of information contained in the domain theory. Other, more 
complicated, interpretations might, for example, taJie into account types of links on the 
path between the words (hypernyms, synonyms, meronjnns, etc.) and exploit common- 
sense observations about WordNet such as words that are closer to the category label 
are more likely to be informative than words farther away. Comparing multiple ways of 
constructing the generative prior from the domain theory and, ultimately, selecting one of 
these interpretations automatically is a fruitful direction for further research. 
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Appendix A. Convergence of the Generative/Discriminative Algorithm 

Let the map H : Z ^ Z determine an algorithm that, given a point ^^''^ generates a se- 
quence of iterates through the iteration n^^^^^ = H{fi^^^). The iterative algorithm 

in Section 4 generates a sequence of iterates ji^^^ = [lJ'f\lJ'2^] G Z by applying the following 



map H: 



H = H2oHi: 



(A.l) 




(A.2) 



with the set U{[fii, ^2]) defined by constraints: 

yi{'uFxi + 6) - 1 > 0, i = 1, .., m 
c_i(w;, 6;^i,i;i) - /3 > 
ci{w,h\ 112,^2) - /3 > 



(A.3) 

(A.4) 
(A.5) 
(A.6) 
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In step 2, if2(u),5) = arg min -(c_i(it;, 6; /ii, Si) + ci(w, 5; //2, S2)) (A-7) 

(/ui,iU2)ev 

with the set V given by the constraints 

if- o{fii;ni,ti) >0 (A.8) 

(p-o{n2;^2,t2)>0 (A.9) 

with o{fi;n,t) ^ \\n-y^{n-t)\\. 

Notice that Hi and H2 are functions because the minima for optimization problems 
(4.10)-(4.14) and (4.15)-(4.16) are unique. This is the case because Step 1 optimizes a 
strictly convex function on a convex set, and Step 2 optimizes a linear non-constant function 
on a strictly convex set. 

Convergence of the objective function ■(/'(/U^*^) — b]gt/([^(*) ^(*)]) ll""^!! algorithm 

was shown in Theorem 4.1. Let F denote the set of points on which the map H does not 
change the value of the objective function, i.e. /j,* & T <^ ^[H{iJ,*)) = We will 

show that every accumulation point of {/u*^*)} lies in T. We will also show that every point 
[/Ui,/U2] ^ r augmented with = Hidfil, fi2]) is a point with no feasible descent 

directions for the optimization problem (4.5)-(4.9), which can be equivalently expressed as: 

min \\w\\s.t.\pi,fi2\&V;[w,b]eU{[iJ,i,fi2\) (A.IO) 

In order to formally state our result, we need a few concepts from the duality theory. 
Let a constrained optimization problem be given by 

min f{x) s.t. Ci{x) > 0,i = 1, ..,k (-^-H) 

X 

The following conditions, known as Karush-Kuhn-Tucker(KKT) conditions are necessary 
for X* to be a local minimum: 

Proposition A.l. If x* is a local minimum of (A. 11), then 3Ai,..,Afe such that 
1- V/(x*) = E-=iAiVc,(a;*) 

2. Xi>Oforyie{l,..,k} 

3. Ci{x*) > /orVi e {l,..,k} 

4. \iCi{x*) = foryie{l,..,k} 

Ai, .., Afe are known as Lagrange multipliers of constraints ci, .., c^. 

The following well-known result states that KKT conditions are sufficient for x* to be 
a point with no feasible descent directions: 

Proposition A. 2. //3Ai,..,Afc such that the following conditions are satisfied at x* : 
1- V/(x*) = E-=iA.Vc,(x*) 
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2. Aj > /orVi G 
i/ien a;* has no feasible descent directions in the problem (A. 11) 

Proof, (sketch) We reproduce the proof given in a textbook by Fletcher (1987). The propo- 
sition is true because for any feasible direction vector s, s'^Vci{x) > for Vx and for Vz G 
{1, .., k}. Hence, s^'Vf{x*) = Yli=i \s^^Ci{x*) > 0, so s is not a descent direction. □ 

The following lemma characterizes the points in the set F: 

Lemma A. 3. Let fi* G F, and let [w*,b*] = Hi[^*) be the optimizer of ^{ji*), and let 
X* = [A^^ 4; 1' ••' ^*(A 4) m' "^*(A 5)' ^*(A 6)\ '^f Lagrange multipliers corresponding to the 

constraints for the solution [w*,b*]. Define ji' = H{fj,*), and let [w',b'] be the optimizer of 
// / fi2' then A*^ e) ~ ^ some A*. // fi'i 7^ then A*^ 5) ~ ^ some A*. 
// both iJ^'i =^ fil and /X2 7^ /i2j then A^^ = A^^ 5) ~ ^ ■f^^ some A*. 



Proof. Consider the case when 
and 



(A.12) 



Ail 



fil (A.13) 

Since fi* G F, \\w'\\ = \\w*\\. Let A' be a set of Lagrange multipliers corresponding to the 
constraints for the solution [w',b']. Since w* is still feasible for the optimization problem 
given by tp{n') (by the argument in Theorem 4.1) and the minimum of this problem is 
unique, this can only happen if 

[w',b'] = [w*,b*]. (A.14) 

Then and A' must satisfy KKT conditions for V'(/"')- (A.12) implies that 

ci{w*; iJ:'2, S2) > ci{w*; 1^2,^2) ^ P i>y the same argument as in Theorem 4.1, which means 

that, by KKT condition (4) for 



(A.6) 



0. 



(A.15) 



Therefore, by KKT condition (1) for V'(m') and (A.15), at [w,b, fii, iJ,2\ = [w* = w',b* = 



db 



(A.4),i 



i=l 



(A.5) 



9c_i(w,b*;/^j,Si) 
dw 

dc-i{w* ,b■,^^\,T.\) 
db 



(A.6) 



9ci(ui,6*;/^2!^2) 
dw 

dci (w* ,b;fM2,'S2) 
db 



which means that KKT conditions (1),(2) for the optimization problem ip{^*) are satisfied 
at the point [u)*,5*] with A* = A . KKT condition (3) is satisfied by feasibility of [it;*, 6*] 
and KKT condition (4) is satisfied by the same condition for ■0(//') and observations (A.13), 
(A.14), and (A.15). 

The proofs for the other two cases (/X2 = A*2)A*'i 7^ Mi a^^d ^^2 7^ A*2)A*'i 7^ Mi) ^^'^ 
analogous. □ 

The following theorem states that the points in F are KKT points (i.e., points at which 
KKT conditions are satisfied) for the optimization problem given by (A. 10). 
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Theorem A.4. If fj* e F and let [w*,b*] = Hi{ii*), then [w* ,b* , is a KKT point for 

the optimization problem given by (A. 10). 



Proof. Let fi' = H{^*). Just like in Lemma A. 3, we only consider the case 

fj,'i = fi\ ^ A^^ 6) ~ ^ Lemma A.3). 

(the proofs for the other two cases are similar). 
By KKT conditions for H2{w*,b*), at fii = iJ,[ 

dc-i{w*,b* -,121,^1) , a(-o(/xi;J^i,t)) , 
d^, = for some A^.g > 0. 

By KKT conditions for iJi(/x*) and (A. 17), at [w,b] = [w*,b*] 



(A.16) 
(A.17) 



(A.18) 



Slliul 
dw 
d\\w\\ 

db 



1=1 



(A.4),: 



+ A 



(A.5) 



9c_i(ii),6*;/ij,Ei) 
Ww 

ac-i(«)*,&;M^Si) 
db 



for some 



^(A.4),1 



A 



(A.4),m 



y 0. 



(A.19) 



By (A.16),(A.17),(A.18), and (A.19), at [w,b, fn, fi2] = [w* ,b* , /j^l = iJ,[, fi*^] 





w\\ 




w 

w\\ 




w\\ 


dill 

d\\w\\ 



9^2 









m 

1=1 







Vi 



+ -^(A.5) 








9c_i(to,6*;/ij,Si) 
dw 

ac_i(iD*,fc;/ij,Si) 

m 

ac-i(w%6*;An,Si) 
dm 





+ 



'\A.8^(A.5) 






a(-o(m;Qi,t)) 





+ A 



(A.6) 



aci(w,6*;jU^,E2) 
dw 

dci (m)*,&;/^2»^2) 
db 


dci{w* ,b*;n2,'S2) 
dn2 



+ -^(A.6) 








a(-o(M2;02,t)) 
dn2 



which means that KKT conditions ^1),(2) for the optimization problem (A. 10) are satisfied 

at the point [w* ,b* , fl^, f^*] with A' = [A^a.4),1' ^(A.4),m' ^(A.5)' ^(A.6)' ^A.8^(A.5)' ^(A.6)]- 

A also satisfies KKT conditions (3), (4) by assumption (A.17) and the KKT conditions for 
Hi and H2. □ 



In order to prove convergence properties of the iterates /U^*) , we use the following theorem 
due to Zangwill (1969): 

Theorem A.5. Let the map H : Z ^ Z determine an iterative algorithm via //(*+^) = 
H{fj,^^^), let ipil^) denote the objective function, and let T be the set of points on which the 
ma,p H does not change the value of the objective function, i.e. /x G L <^=> ip(H(fj,)) = tp{fJ-). 
Suppose 
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1. H is uniformly compact on Z, i.e. there is a compact subset Zq C. Z such that 
H{n) e Zo for G Z. 

2. H is strictly monotonic on Z — T, i.e. ip{H{fj,)) < ipdj,). 

3. H is closed on Z — V, i.e. if Wi ^ w and H{wi) — > ^, then ^ = H{w). 

Then the accumulation points of the sequence of fi^^^ lie in T. 

The following proposition shows that minimization of a continuous function on a feasible 
set which is a continuous map of the function's argument forms a closed function. 

Proposition A. 6. Given 

1. a real-valued continuous function f on Ax B, 

2. a point-to-set map U : A ^ 2^ continuous with respect to the Hausdorff metric:^ 
dist{X,Y) = max{d{X,Y),d{Y,X)), where d{X,Y) = maxjjgx miuj^gy ||x — 

define the function F : A—^ B by 

F{a) = arg min /(a, b') = {b : f{a, b) < f{a, b') for V6' G C/(a)}, 

b'eU{a) 

assuming the minimum exists and is unique. Then, the function F is closed at a. 

Proof. This proof is a minor modification of the one given by Gunawardana and Byrne 
(2005). Let {a*^*^} be a sequence in A such that 

a(*) ^a,F(a(*))^6 (A.20) 

The function F is closed at a if F{a) = b. Suppose this is not the case, i.e. b / F{a) = 
argminb/e[/(a) /(a,6')- Therefore, 

3b = arg min f{b') such that /(a, b) > f{a, b) (A.21) 

b'eU{a) 

By continuity of /(•,•) and (A.20), 

/(aW,F(aW))^/(a,6) (A.22) 

By continuity of [/(•) and (A.20), 

dist{U{a^*^), U{a)) ^ ^ 36^ ^ b and S^*) G C/(a*), for Vi. (A.23) 

(A.22), (A.23), and (A.21) imply that 

3K such that /(aW,F(aW)) > /(aW,^^), for Vt > K (A.24) 

which is a contradiction since by assumption, F{a^^^) = argminb'g;7(a*) fib') and by (A.24), 
lit) g [/(^W). □ 

5. A point-to-sot map U{a) maps a point a to a set of points. U{a) is continuous with respect to a distance 
metric dist iff a*'' — >• a implies dist{U{a'-^''),U{a)) — >■ 0. 
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Proposition A.7. The function H defined by (A.1)-(A.7) is closed. 

Proof. Let be a sequence such that jU^*^ — >■ /x*. Since all the iterates lie in 

the closed feasible region bounded by constraints (4.6)-(4.9) and the boundary of U{p) is 
piecewise linear in /i, the boundary of ?7(/i converges uniformly to the boundary of U{p*) 
as n*, which implies that the Hausdorff distance between the boundaries converges 

to zero. Since the Hausdorff distance between convex sets is equal to the Hausdorff distance 
between their boundaries, dist{U{fj,^^^),U{fj,*)) also converges to zero. Hence, proposition 
A. 6 implies that Hi is closed. The same proposition implies that H2 is closed. A composition 
of closed functions is closed, hence H is closed. □ 

We now prove the main result of this Section: 

Theorem 4.2. Let H be the function defined by (A.1)-(A.7) which determines the gener- 
ative/discriminative algorithm via = H{pL^^^). Then accumulation points fi* of the 
sequence augmented with [w* ,b*\ = have no feasible descent directions for the 
original optimization problem given by (4.5)-(4.9). 

Proof. The proof is by verifying that H satisfies the properties of Theorem A. 5. Closedness 
of H was shown in Proposition A. 7. Strict monotonicity of ^'(A'^*^) was shown in Theorem 
4.1. Since all the iterates fi^^^ are in the closed feasible region bounded by constraints (4.6)- 
(4.9), H is uniformly compact on Z. Since all the accumulation points fi* lie in F, they are 
KKT points of the original optimization problem by Theorem A. 4, and, therefore, have no 
feasible descent directions by Proposition A.2. □ 

Appendix B. Generalization of the Generative/Discriminative Classifier 

We need a few auxiliary results before proving Theorem 6.2. The first proposition bounds 

the angle of rotation between two vectors wi,W2 and the distance between them if the angle 
of rotation between each of these vectors and some reference vector v is sufficiently small: 

Proposition B.l. Let \\wi\\ = \\w2\\ = \\v\\ = 1. Ifwjv > a > and wjv > a>0, then 

1. wjw2 >2a^ -I 

2. \\wi - u;2|| < 2A/(l-a2) 
Proof. 

1. By the triangle inequality, axccos{ur[w2) < sxccos{w'^v) + arccos(wjf) < 2arccos(a) 
(since the angle between two vectors is a distance measure). Taking cosines of both 
sides and using trigonometric equalities yields w'^W2 > I0? — 1. 

2. Expand \ w\ — W2\^ = ||u)i||^ + ||i/;2||^ — 2w\w2 = 2(1 — w\w2). Since w\w2 > 2q;^ — 1 
from part 1, — W2\\' < 4(1 — a^). 

□ 

The next proposition bounds the angle of rotation between two vectors t and /x if they 
are not too far away from each other as measured by the L2-norm distance: 



49 



Epshteyn & DeJong 



Proposition B.2. Let \\t\\ = v, \\n - t\\ < r. Then > f^^^^-^ . 

Proof. Expanding — = + — 2t^fi and using — < r^, we get ||^*||||^|| > 

^^WT ~'~ W ~ FITM-'' triangle inequality v — t < \\t\\ — — 1\\ < ||/i|| < 

Pll + 11^ ~ *ll < + T and simplify. □ 

The following proposition will be used to bound the angle of rotation between the normal 
w of the separating hyperplane and the mean vector t of the hyper-prior distribution: 

Proposition B.3. Let j^^^j^ > P > and \\n -t\\ <(p< \\t\\. Then > {2a'^ - 1), 

where a = mm{P, £~ft\\) )- 

Proof. Follows directly from Propositions B.l (part 1) and B.2. □ 

We now prove Theorem 6.2, which relies on parts of the well-known proof of the fat- 
shattering dimension bound for large margin classifiers derived by Taylor and Bartlett 
(1998). 

Theorem 6.2. Let F be the class of a-priori constrained functions defined by 6.1, and 

let \min{P) and \max{P) denote the minimum and maximum eigenvalues of matrix P, 
respectively. If a set of points S is j -shattered by F, then \S\ < ^•"ys^^" , where 

a - maximal, a2j wiin ai - mm(^ ||^^|| , ||t2||(A^„^(n2)¥')^+||t2||) - ^^^\ > 
llJliawZlS^SlI) )^ assumm^ that P>0, \\ti\\ > \\ti - ml and ai>^,i = 1,2. 

Proof. First, we use the inequality XminiP) ||w|| < || < XmaxiP) \\w\\ to relax the 

constraints 

""^^ > /3 ^ ^ > A_(E.)/3 (B.l) 



^2 W 



^2 ^^^(Ai2 - i2) < (P ^ 11^2 - t2\\ < JT—i7 = 'P>'max{^2)- (B.2) 



0-'/'(/Xi - ti) 



< (f are relaxed 



T 

The constraints imposed by the second prior i r^-^^^i > /3, 
in a similar fashion to produce: 

""^IJiP^ > Xmini^l)^ (B.3) 
IIMI - *i|| < ^Xmaxi^l) (B.4) 

Now, we show that if the assumptions made in the statement of the theorem hold, then 
every subset 5o C 5 satisfies ||E So - Ei^ - So)\\ < ^■^'("y-"') . 

Assume that S is 7-shattered by F. The argument used by Taylor and Bartlett (1998) 
in Lemma 1.2 shows that, by the definition of fat-shattering, there exists a vector wi such 
that 

wiiY.^o-^{S-So))>\S\j. (B.5) 
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Similarly (reversing the labeling of and — ^o), there exists a vector W2 such that 

W2{Y,iS-So)-Y,So)>\S\r (B.6) 

Hence, {wi — W2){Y^So — ^^(S — Sq)) > 21517, which, by Cauchy-Schwartz inequality, 
implies that 

2|5|7 

""^-"^"-|IE5o-E(^-^o)|| ^""-'^ 

The constraints on the classifier represented in B.l and B.2 imply by Proposition B.3 that 

> (2"! - 1) and j^^^ > (2ai - 1) . Now, applying Proposition B.l (part 2) and 
simplifying, we get 

\\wi-W2\\ <4^af(l-af). (B.8) 
Applying the same analysis to the constraints B.3 and B.4, we get 

\\wi -W2\\< 4y/a|(l-a|). (B.9) 
Combining B.7, B.8, and B.9, we get 

Wyso-Yis-so) >—M^= (B.io) 

with a as defined in the statement of the theorem. 

Taylor and Bartlett's (1998) Lemma 1.3 proves, using the probabilistic method, that 
some SoQ S satisfies 

||5^5o-^(5-So)|| < ^i?. (B.ll) 
Combining B.IO and B.ll yields \S\ < ^-^'("'(^i-"')) . □ 
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